-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add HuggingfaceDatasetReader for using Huggingface datasets #5095
base: main
Are you sure you want to change the base?
Conversation
Looking at the lint and format issues. Will address them hopefully today. |
ebdf5fa
to
990388f
Compare
Locally new tests pass - But since they are doing downloads it is quite possible they fail. Will try to see if I can pin point it. But should be okay for the draft. |
It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really excited to have this. I've been wanting to do this since HF came out with the datasets project.
Thanks for this PR, @pab-vmware, @divijbajaj, @annajung, @pkini-vmware and @agururajvais. And thanks @dirkgr for your thorough review. Just let me know if there is anything we can do on Hugging Face Datasets to help you with the integration. 🤗 |
Nice Idea, we can formulate a custom dataset and test around it to validate the mapping of datasets.features to allenlp.data.fields. In the meantime I noticed glue is small, will change UTs to use that only. We can probably have sanity scripts to test w.r.t specific important datasets to validate support for them, independent of UT. Let me make one such dataset as part of it. Will ask on where to place it (my repo or somewhere else) once I have it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good so far!
ad78b69
to
977c0b2
Compare
6b3e402
to
f77cfa3
Compare
Looks like I need to do more changes to get it working even for glue. Will inform once it is done. |
Working for 3 datasets. With couple different types of features. Pending addressing of type check failures Samples from the 2 datasets, xnli is a little too big for a comment
|
55c0124
to
d082e55
Compare
Have working implementation with limited , features-> field mapping. let me work on getting the code coverage up for the diff even if I have to use slightly heavier dataset at this point. I will try to increase the features that we are mapping but is the current change (post coverage) sufficient to go in? |
That would great! |
What's the status of this? Is it close? Anything you need? I'd love to have this in, maybe even for the 2.3 release! |
Hi Dirk, So now that the coverage is up to 90%, this MR IMO although slightly Raw in w.r.t implementation can go in. There are things that need to be done, for most of which I have added TODOs and I will pick them up one by one in the coming weeks.
|
So I have updated the datasets list, added validation tests that are now skipped. Squashed all changes into a single commit. Please take a look when time permits. I will work on cleaning up the code and providing nested support. |
Don't worry about doing that for our sake. It'll get squashed when we merge anyways. In fact, now I have no way of seeing just the changes you made since the last time I reviewed this. |
* fix race condition when extracting files with cached_path * add warning when directory already exists
Bumps [checklist](https://github.com/marcotcr/checklist) from 0.0.10 to 0.0.11. - [Release notes](https://github.com/marcotcr/checklist/releases) - [Commits](https://github.com/marcotcr/checklist/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
…5221) * ADD: add from_pretrained method for vocab * MOD: test format * MOD: format file * MOD: update changelog * MOD: fix bug * MOD: fix bug * MOD: fix typo * MOD: make the mothod in class * MOD: fix bug * MOD: change to instance method * MOD: fix typo * MOD: fix bug * MOD: change oov to avoid bug * Update allennlp/data/vocabulary.py * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/data/vocabulary.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * MOD: fix formate * MOD: add test case * Update CHANGELOG.md * MOD: fix worker info bug * ADD: update changelog * MOD: fix format * Update allennlp/data/data_loaders/multitask_data_loader.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update CHANGELOG.md Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * MOD: add demo code * MOD: align code * MOD: fix bug * MOD: fix bug * MOD: fix bug * MOD: formate code * Update allennlp/data/data_loaders/data_collator.py Co-authored-by: Pete <epwalsh10@gmail.com> * fix error * MOD: add test code * mod: change tokenizer * mod: fix tokenizer * MOD: fix bug * MOD: fix bug * MOD: fix bug * Update allennlp/data/data_loaders/data_collator.py Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com> * MOD: update changelog * MOD: update change log * Update allennlp/data/data_loaders/data_collator.py We should be using underscores for everything. * Formatting Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> Co-authored-by: Dirk Groeneveld <groeneveld@gmail.com>
Adding support for inputs to the backbone with more than 3 dimensions
* Removes unused variable * Formatting * Make sure we always restore the model's weights properly * Give TrainerCallbacks the ability to save and load state dicts * Give MovingAverage the ability to save and load state dicts * Do not set gradients to None * Typo * Remove unused variable * Typo * Entirely new checkpointing code * Formatting * Make mypy happy lol * Makes the no-op trainer work with the new checkpointer * Mark epochs as completed when they're skipped * Changelog * Fixes how we get the best weights after a training run * Mypy is annoying * Callback fixes * Fix the no op trainer * Simplify * Assorted checkpointer fixes * Mypy is now happy * Fixed all the tests except for one * Removed unused variable * Fix trainer restore logic * Fix test for trainer restore logic * Check the Checkpointing branch of the models repo * Help mypy along * Fixed finalizing logic * More mypy stuff * Update allennlp/training/checkpointer.py Co-authored-by: Pete <petew@allenai.org> * Make weaker claims Co-authored-by: Pete <petew@allenai.org>
* Implementing blocking repeated ngrams * Adding comment * Adding unit tests for the end to end beam search * Renaming class * Adding comment about function * Simplifying indexing to variable * Refactoring the state copying into the class * Reformatting * Editing changelog * fix line too long * comments * doc updates Co-authored-by: Pete <petew@allenai.org> Co-authored-by: epwalsh <epwalsh10@gmail.com>
* Make BeamSearch Registrable * Update changelog * Remove unused import * Update CHANGELOG.md Co-authored-by: Pete <petew@allenai.org> Co-authored-by: Pete <epwalsh10@gmail.com>
* initial commit * general self attn * fixing bugs, adding tests, adding docs * updating other modules * refactor * bug fix * update changelog * fix shape * fix format * address feedback * small doc fix * Update allennlp/modules/transformer/transformer_stack.py Co-authored-by: Pete <petew@allenai.org> * remove old file Co-authored-by: epwalsh <epwalsh10@gmail.com> Co-authored-by: Pete <petew@allenai.org>
* Fix tqdm logging into multiple files with allennlp-optuna * Update changelog * Add unittest for resetting tqdm logger handlers Co-authored-by: Pete <petew@allenai.org>
* bug fix * common lexicons * update changelog * Update CHANGELOG.md
* added linear and hard debiasers * worked on documentation * committing changes before branch switch * committing changes before switching branch * finished bias direction, linear and hard debiasers, need to write tests * finished bias direction test * Commiting changes before switching branch * finished hard and linear debiasers * finished OSCaR * bias mitigators tests and bias metrics remaining * added bias mitigator tests * added bias mitigator tests * finished tests for bias mitigation methods * fixed gpu issues * fixed gpu issues * fixed gpu issues * resolve issue with count_nonzero not being differentiable * added more references * fairness during finetuning * finished bias mitigator wrapper * added reference * updated CHANGELOG and fixed minor docs issues * move id tensors to embedding device * fixed to use predetermined bias direction * fixed minor doc errors * snli reader registration issue * fixed _pretrained from params issue * fixed device issues * evaluate bias mitigation initial commit * finished evaluate bias mitigation * handles multiline prediction files * fixed minor bugs * fixed minor bugs * improved prediction diff JSON format * forgot to resolve a conflict * Refactored evaluate bias mitigation to use NLI metric * Added SNLIPredictionsDiff class * ensured dataloader is same for bias mitigated and baseline models * finished evaluate bias mitigation * Update CHANGELOG.md * Replaced local data files with github raw content links * Update allennlp/fairness/bias_mitigator_applicator.py Co-authored-by: Pete <petew@allenai.org> * deleted evaluate_bias_mitigation from git tracking * removed evaluate-bias-mitigation instances from rest of repo * addressed Akshita's comments * moved bias mitigator applicator test to allennlp-models * removed unnecessary files Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-108.us-west-2.compute.internal> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-1-108.us-west-2.compute.internal> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Pete <petew@allenai.org>
Bumps [black](https://github.com/psf/black) from 21.5b1 to 21.5b2. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](https://github.com/psf/black/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fixed broken link
…gins()` (allenai#5246) * ensure allennlp is a default plugin * fix logging issue * fixes * actually fix
* added BackwardCallback * finished tests * fixed linting issue * revised design per Dirk's suggestion * added OnBackwardException, changed loss to batch_ouputs, etc. Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local>
b277534
to
5f702ef
Compare
…o datasets_feature
Convert Dict to N ListFields for Dict of Lists
… moving it down the list.
Hi, |
Hi @wolhandlerdeb, Please feel free to let clone the changes and continue the work. |
Added a new reader to allow for reading huggingface datasets as instance
Mapped limited
datasets.features
toallenlp.data.fields
Verified for selective dataset and/or dataset configurations for training split, mentioned in the documentation comments of the reader.
datasets==1.5.0
Joint work with @divijbajaj @annajung @prajaktakini-vmware @agururajvais
Signed-off-by: Abhishek P (VMware) pab@vmware.com
Fixes #4962
Changes proposed in this pull request:
Introduce a new reader that wraps huggingface datasets to provide instances for a split of the dataset with configuration if required